Search for: All records

Creators/Authors contains: "Chatterjee, Abhijit"

« Prev Next »

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

Efficient Optimized Testing of Resistive RAM Based Convolutional Neural Networks

https://doi.org/10.1109/IOLTS60994.2024.10616092

Saha, Anurup; Ma, Kwondo; Amarnath, Chandramouli; Chatterjee, Abhijit (July 2024, IEEE)

Resistive random access memory (RRAM) based memristive crossbar arrays enable low power and low latency inference for convolutional neural networks (CNNs), making them suitable for deployment in IoT and edge devices. However, RRAM cells within a crossbar suffer from conductance variations, making RRAM-based CNNs vulnerable to degradation of their classification accuracy. To address this, the classification accuracy of RRAM based CNN chips can be estimated using predictive tests, where a trained regressor predicts the accuracy of a CNN chip from the CNN’s response to a compact test dataset. In this research, we present a framework for co-optimizing the pixels of the compact test dataset and the regressor. The novelty of the proposed approach lies in the ability to co-optimize individual image pixels, overcoming barriers posed by the computational complexity of optimizing the large numbers of pixels in an image using state-of-the-art techniques. The co-optimization problem is solved using a three step process: a greedy image downselection followed by backpropagation driven image optimization and regressor fine-tuning. Experiments show that the proposed test approach reduces the CNN classification accuracy prediction error by 31% compared to the state of the art. It is seen that a compact test dataset with only 2-4 images is needed for testing, making the scheme suitable for built-in test applications.
more » « less
Full Text Available
Error Resilient Online Reinforcement Learning Using Adaptive Statistical Checks

https://doi.org/10.1109/TCAD.2025.3529820

Amarnath, Chandramouli; Mejri, Mohamed; Isenberg, Jackson; Chatterjee, Abhijit (August 2025, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems)

Online deep reinforcement learning (deep RL)- based systems are being increasingly deployed in a variety of safety-critical applications. Due to the dynamic nature of the environments they work in, onboard reinforcement learning (RL) hardware is vulnerable to soft errors from radiation, thermal effects and electrical noise that corrupts the results of computations. Existing approaches to on-line error resilience in machine learning systems have relied on the availability of large training datasets to configure resilience parameters. This is not always feasible for online RL systems. Similarly, other approaches involving specialized hardware or modifications to training algorithms are difficult to implement for onboard RL applications. In contrast, we present a novel error resilience approach for online RL that leverages running statistics of neuron output values collected across the (real-time) RL training process to configure error detection thresholds (called checks) for the deep RL forward pass. Similarly, we formulate checks on the deep RL backward pass using running statistical thresholds on reduceddimension checksums of online learning weight updates to rapidly detect and correct errors in online deep RL training. In this methodology, statistical concentration bounds leveraging running statistics are used to diagnose neuron outputs or weights as erroneous. The use of running statistics allows the checks to adapt to changes caused by continual on-line RL training. Erroneous neurons are set to zero (suppressed) in the forward pass. Erroneous weight updates are frozen, allowing nonerroneous weight updates to proceed and allowing online learning without rerunning training episodes. Our approach is compared against the state of the art and validated on several RL algorithms as well as a hardware validation platform.
more » « less
Full Text Available
SiPT: Signature-Based Predictive Testing of RRAM Crossbar Arrays for Deep Neural Networks

https://doi.org/10.1109/TETC.2025.3533895

Ma, Kwondo; Saha, Anurup; Amarnath, Chandramouli; Chatterjee, Abhijit (October 2025, IEEE Transactions on Emerging Topics in Computing)

Resistive Random-Access Memory (RRAM) crossbar array-based Deep Neural Networks (DNNs) are increasingly attractive for implementing ultra-low-power computing for AI. However, RRAM-based DNNs face inherent challenges from manufacturing process variability, which can compromise their performance (classification accuracy) and functional safety. One way to test these DNNs is to apply the exhaustive set of test images to each DNN to ascertain its performance; however, this is expensive and time-consuming. We propose a signature-based predictive testing (SiPT) in which a small subset of test images is applied to each DNN and the classification accuracy of the DNN is predicted directly from observations of the intermediate and final layer outputs of the network. This saves the test cost while allowing binning of RRAMbased DNNs for performance. To further improve the test efficiency of SiPT, we create the optimized compact set of test images, leveraging image filters and enhancements to synthesize images and develop a cascaded test structure, incorporating multiple sets of SiPT modules trained on compact test subsets of varying sizes. Through experimentation across diverse test cases, we demonstrate the viability of our SiPT framework under the RRAM process variations, showing test efficiency improvements.
more » « less
Full Text Available
Error Resilient Hyperdimensional Computing Using Hypervector Encoding and Cross-Clustering

https://doi.org/10.1109/VTS60656.2024.10538955

Mejri, Mohamed; Amarnath, Chandramouli; Chatterjee, Abhijit (April 2024, IEEE)

Emerging brain-inspired hyperdimensional computing (HDC) algorithms are vulnerable to timing and soft errors in associative memory used to store high-dimensional data representations. Such errors can significantly degrade HDC performance. A key challenge is error correction after an error in computation is detected. This work presents two novel error resilience frameworks for hyperdimensional computing systems. The first, called the checksum hypervector encoding (CHE) framework, relies on creation of a single additional hypervector that is a checksum of all the class hypervectors of the HDC system. For error resilience, elementwise validation of the checksum property is performed and those elements across all class vectors for which the property fails are removed from consideration. For an HDC system with K class hypervectors of dimension D, the second cross-hypervector clustering (CHC) framework clusters D, Kdimensional vectors consisting of the i-th element of each of the K HDC class hypervectors, 1 ≤ i ≤ K. Statistical properties of these vector clusters are checked prior to each hypervector query and all the elements of all K-dimensional vectors corresponding to statistical outlier vectors are removed as before. The choice of which framework to use is dictated by the complexity of the dataset to classify. Up to three orders of magnitude better resilience to errors than the state-of-the-art across multiple HDC high-dimensional encoding (representation) systems is demonstrated.
more » « less
Full Text Available
Post-Manufacture Criticality-Aware Gain Tuning of Timing Encoded Spiking Neural Networks for Yield Recovery

https://doi.org/10.1109/ETS61313.2024.10567479

Saha, Anurup; Ma, Kwondo; Amarnath, Chandramouli; Chatterjee, Abhijit (May 2024, IEEE)

Time-to-first-spike(TTFS ) encoded spiking neural networks (SNNs), implemented using memristive crossbar arrays (MCA), achieve higher inference speed and energy efficiency compared to artificial neural networks (ANNs) and rate encoded SNNs. However, memristive crossbar arrays are vulnerable to conductance variations in the embedded memristor cells. These degrade the performance of TTFS encoded SNNs, namely their classification accuracy with adverse impact on the yield of manufactured chips. To combat this yield loss, we propose a post-manufacture testing and tuning framework for these SNNs. In the testing phase, a timing encoded signature of the SNN, which is statistically correlated to the SNN performance, is extracted. In the tuning phase, this signature is mapped to optimal values of the tuning knobs (gain parameters), one parameter per layer, using a trained regressor, allowing very fast tuning (about 150ms). To further reduce the tuning overhead, we rank order hidden layer neurons based on their criticality and show that adding gain programmability only to 50% of the neurons is sufficient for performance recovery. Experiments show that the proposed framework can improve yield by up to 34% and average accuracy of memristive SNNs by up to 9%.
more » « less
Full Text Available
DeepER-HD: An Error Resilient HyperDimensional Computing Framework with DNN Front-End for Feature Selection

https://doi.org/10.1109/LATS62223.2024.10534617

Mejri, Mohamed; Amarnath, Chandramouli; Chatterjee, Abhijit (April 2024, IEEE)

Brain-inspired hyperdimensional (HD) computing models mimic cognition through combinatorial bindings of biological neuronal data represented by high-dimensional vectors and related operations. However, the efficacy of HD computing depends strongly on input signal and data features used to realize such bindings. In this paper, we propose a new HD-computing framework based on a co-trainable DNN-based feature extractor pre-processor and a hyperdimensional computing system. When trained with restrictions on the ranges of hypervector elements for resilience to memory access errors, the framework achieves up to 135% accuracy improvement over baseline HD-computing for error-free operation and up to three orders of magnitude improvement in error resilience compared to the state-of-the-art. Results for a range of applications from image classification, face recognition, human activity recognition and medical diagnosis are presented and demonstrate the viability of the proposed ideas.
more » « less
Full Text Available
Learning Assisted Post-Manufacture Testing and Tuning of RRAM-Based DNNs for Yield Recovery

https://doi.org/10.23919/DATE58400.2024.10546505

Ma, Kwondo; Saha, Anurup; Amarnath, Chandramouli; Chatterjee, Abhijit (March 2024, IEEE)

Variability-induced accuracy degradation of RRAM based DNNs is of great concern due to their significant potential for use in future energy-efficient machine learning architectures. To address this, we propose a two-step process. First, an enhanced testing procedure is used to predict DNN accuracy from a set of compact test stimuli (images). This test response (signature) is simply the concatenated vectors of output neurons of intermediate final DNN layers over the compact test images applied. DNNs with a predicted accuracy below a threshold are then tuned based on this signature vector. Using a clustering based approach, the signature is mapped to the optimal tuning parameter values of the DNN (determined using off-line training of the DNN via backpropagation) in a single step, eliminating any post-manufacture training of the DNN weights (expensive). The tuning parameters themselves consist of the gains and offsets of the ReLU activation of neurons of the DNN on a per-layer basis and can be tuned digitally. Tuning is achieved in less than a second of tuning time, with yield improvements of over 45% with a modest accuracy reduction of 4% compared to digital DNNs.
more » « less
Full Text Available
A Novel Approach to Error Resilience in Online Reinforcement Learning

Chandramouli, Amarnath.; Chatterjee, Abhijit. (July 2023, International On-Line Testing Symposium)

Online reinforcement learning (RL) based systems are being increasingly deployed in a variety of safety-critical applications ranging from drone control to medical robotics. These systems typically use RL onboard rather than relying on remote operation from high-performance datacenters. Due to the dynamic nature of the environments they work in, onboard RL hardware is vulnerable to soft errors from radiation, thermal effects and electrical noise that corrupt the results of computations. Existing approaches to on-line error resilience in machine learning systems have relied on availability of the large training datasets to configure resilience parameters, which is not necessarily feasible for online RL systems. Similarly, other approaches involving specialized hardware or modifications to training algorithms are difficult to implement for onboard RL applications. In contrast, we present a novel error resilience approach for online RL that makes use of running statistics collected across the (real-time) RL training process to configure error detection thresholds without the need to access a reference training dataset. In this methodology, statistical concentration bounds leveraging running statistics are used to diagnose neuron outputs as erroneous. These erroneous neurons are then set to zero (suppressed). Our approach is compared against the state of the art and validated on several RL algorithms involving the use of multiple concentration bounds on CPU as well as GPU hardware.
more » « less
Full Text Available
A Resilience Framework for Synapse Weight Errors and Firing Threshold Perturbations in RRAM Spiking Neural Networks

Saha, Anurup.; Amarnath, Chandramouli.; Chatterjee, Abhijit. (May 2023, European Test Symposium)

Spiking Neural Networks (SNNs) can be implemented with power-efficient digital as well as analog circuitry. However, in Resistive RAM (RRAM) based SNN accelerators, synapse weights programmed into the crossbar can differ from their ideal values due to defects and programming errors, degrading inference accuracy. In addition, circuit nonidealities within analog spiking neurons that alter the neuron spiking rate (modeled by variations in neuron firing threshold) can degrade SNN inference accuracy when the value of inference time steps (ITSteps) of SNN is set to a critical minimum that maximizes network throughput. We first develop a recursive linearized check to detect synapse weight errors with high sensitivity. This triggers a correction methodology which sets out-of-range synapse values to zero. For correcting the effects of firing threshold variations, we develop a test methodology that calibrates the extent of such variations. This is then used to proportionally increase inference time steps during inference for chips with higher variation. Experiments on a variety of SNNs prove the viability of the proposed resilience methods.
more » « less
Full Text Available
Error Resilient Transformer Networks: A Novel Sensitivity Guided Approach to Error Checking and Suppression

Ma, Kwondo.; Amarnath, Chandramouli.; Chatterjee, Abhijit. (May 2023, European Test Symposium)

Transformer networks have achieved remarkable success in Natural Language Processing (NLP) and Computer Vision applications. However, the underlying large volumes of Transformer computations demand high reliability and resilience to soft errors in processor hardware. The objective of this research is to develop efficient techniques for design of error resilient Transformer architectures. To enable this, we first perform a soft error vulnerability analysis of every fully connected layers in Transformer computations. Based on this study, error detection and suppression modules are selectively introduced into datapaths to restore Transformer performance under anticipated error rate conditions. Memory access errors and neuron output errors are detected using checksums of linear Transformer computations. Correction consists of determining output neurons with out-of-range values and suppressing the same to zero. For a Transformer with nominal BLEU score of 52.7, such vulnerability guided selective error suppression can recover language translation performance from a BLEU score of 0 to 50.774 with as much as 0.001 probability of activation error, incurring negligible memory and computation overheads.
more » « less
Full Text Available

« Prev Next »